METHOD FOR MONITORING FAULT OF OPERATING SYSTEM AND 

APPLICATION PROGRAM 

The present application is a continuation of application Serial No. 
5 09/642,769, filed August 22, 2000, the contents of which are incorporated 
herein by reference. 

BACKGROUND OF THE INVENTION 

The present invention relates to a fault monitoring method for 

10 computer, and in particular to a fault monitoring method of an operating 
system (OS) and an application program (AP). 

In general, an AP encounters a fault for several reasons and stops. 
Main causes are defects included in the AP itself and faults of an OS on which 
the AP is mounted. As for such an AP that its operation needs to be 

15 continued, the following method has been adopted. A fault is monitored. If a 
fault has been detected, then the operation of the AP is resumed from a state 
in which it was before the occurrence of the fault, and thereby recovery of the 
AP from the fault is attempted. 

As one of fault monitoring methods for AP and OS, there is a method 

20 called watchdog. "Fault Tolerant Computer" written by Takashi NANYA, 
published by OHM-SHA says "The watchdog timer is a different process 
which is independent of a monitored process. The monitored process is so 
designed that a timer is reset at intervals of a fixed time (for example, in the 
range of several microseconds to several minutes) during the execution of the 

25 monitored process. If the timer is not reset until the time is up, some fault is 
considered to have occurred in the monitored process." 
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In the case where a fault of an AP is monitored, an AP fault monitor 
having a watchdog receives a periodical alive message from the AP. If the 
alive message stops for a predetermined time, the AP fault monitor judges an 
AP fault to have occurred and restarts the AP. In the case where a fault of an 
5 OS is monitored, an OS fault monitor having a watchdog receives a periodical 
alive message from the OS. If the alive message disappears for a 
predetermined time, the OS fault monitor judges an OS fault to have occurred 
and restarts the OS. 

Furthermore, for recovery of an AP from a fault, a technique called 

10 check-point is used. In the check-point technique, execution information of the 
AP is acquired from the monitored AP periodically and reserved, When a fault 
has occurred, the reserved execution information is taken out, and processing 
of the AP is resumed from a check-point. 

As for an AP fault monitor using software, the AP fault monitor 

15 operates on the same OS as the monitored AP. In some cases, therefore, the 
AP fault monitor using software cannot cope with such an Al' fault caused by 
the OS. Furthermore, there is known such an OS fault monitoring method that 
a watchdog is formed of dedicated hardware and the watchdog monitors a 
periodical alive message supplied from an OS. However, there is a problem 

20 that dedicated hardware must be prepared. 

On the other hand, as for the check-point, if check-point information is 
preserved in a low speed memory, then it takes a considerable time to 
preserve the check-point information, and consequently the check-point 
repetition period becomes long. As a result, recovery from a fault must be 

25 conducted from a state of a considerably long time before the time of the fault 
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occurrence. Furthermore, it is possible to provide a high speed non-volatile 
memory, such as a static RAM, apart from a volatile memory, such as a 
dynamic RAM, managed by the OS, and preserve check-point information in 
the high speed non-volatile memory. However, there is a problem that 
5 dedicated hardware must be prepared. 

SUMMARY OF THE INVENTION 

An object of the present invention is to provide a method for monitoring 
a fault of the OS by using software, without adding dedicated hardware, 

10 Another object of the present invention is to provide a method for 

monitoring a fault of an AP and preserving check-point information of the AP 
at high speed without adding dedicated hardware. 

The present invention solves the above described problems. In 
accordance with the present invention, in an operating system fault monitoring 

15 method for a computer, the computer including a first OS, a second OS 
different from the first OS, a multi-OS controller for managing computer 
resources, the multi-OS controller having inter-OS communication means 
between the first OS and the second OS, and a fault monitor operating on the 
second OS, the operating system fault monitoring method includes the steps 

20 of transmitting an alive message from the first OS to the fault monitor via the 
inter-OS communication means, and determining whether the alive message 
has been received by the fault monitor within a predetermined time. 

In accordance with the present invention, in an application program 
fault monitoring method for a computer, the computer including an AP fault 

25 monitor operating on the first OS, and a high rank fault monitor operating on a 
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second OS to monitor not only a fault of the first OS but also a fault of the AP 
fault monitor via inter-OS communication means, the application program fault 
monitoring method includes the step of monitoring a fault of the AP fault 
monitor operating on the first OS by using the high rank fault monitor. An AP 
5 to be monitored by the AP fault monitor preserves check-point information in a 
shared memory region on a main memory. In the shared memory region, 
information is preserved by a fault and restart of the first OS as well. 

According to the present invention, fault monitoring of an OS to be 
monitored is conducted by utilizing a multi-OS environment and using a high 

10 rank fault monitor operating on another OS, as heretofore described. Without 
adding dedicated hardware, therefore, a fault of the OS can be monitored. 
Furthermore, a fault of an AP fault monitor can also be monitored by using the 
high rank fault monitor. Furthermore, since the high rank fault monitor using 
software is used, a recovery method to be used when a fault of each of the 

15 OS and AP fault monitor has occurred can be set finely. In addition, a 

monitored AP can preserve check-point information at high speed without 
adding dedicated hardware. 

BRIEF DESCRIPTION OF THE DRAWINGS 
20 FIG. 1 is a diagram showing the configuration of a computer forming an 

embodiment of the present invention; 

FIG. 2 is a diagram showing the configuration of a storage region in a 
memory, in the embodiment shown in FIG. 1 ; 

FIG. 3 is a flow chart showing the procedure of start processing of each 
25 program, in the embodiment shown in FIG. 1 ; 
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FIG. 4 is a flow chart showing the procedure of OS monitoring 
processing conducted by a high rank fault monitor 3 and recovery processing 
of a relating program at the time of an OS fault, in the embodiment shown in 
FIG. 1; 

5 FIG. 5 is a flow chart showing the procedure of monitoring processing 

of an AP fault monitor conducted by a high rank fault monitor 3 and recovery 
processing of a related program at the time of a fault of the AP fault monitor, 
in the embodiment shown in FIG. 1 ; and 

FIG. 6 is a flow chart showing the procedure of monitoring processing 
10 of a monitored AP conducted by an AP fault monitor 4 and recovery 

processing of a related program of the monitored AP at the time of a fault, in 
the embodiment shown in FIG. 1. 

DESCRIPTION OF THE PREFERRED EMBODIMENT 
15 Hereafter, an embodiment of the present invention will be described. 

FIG. 1 is a diagram showing the configuration of a computer 100 in the 
present embodiment. The computer 100 can be divided broadly into a 
processor 10 and a memory (main storage) 9. An input/output device and a 
storage connected to the processor 10 are not illustrated. The memory 9 
20 stores programs of a first OS 1 , a second OS 2, a high rank fault monitor 

(hereafter referred to as "high rank monitor") 3, an AP fault monitor (hereafter 
referred to as "AP monitor") 4, a monitored AP 5, and a multi-OS controller 6. 
These programs are executed by the processor 10. Furthermore, the memory 
9 includes a shared memory region 1 1 which can be accessed by both the 
25 first OS 1 and the second OS 2. 
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The monitored AP 5 is an application mounted on the first OS 1. The 
monitored AP 5 may be a program for conducting general business 
application. The AP monitor 4 is an application mounted on the first OS 1 , and 
it is a program for conducting fault monitoring of the monitored AP 5. The 
5 second OS 2 is an OS which is different from the first OS 1 . The second OS 2 
is desired to be an OS regarded as an OS having a higher reliability than the 
first OS 1 . The high rank monitor 3 is an application mounted on the second 
OS 2, and it is a program for conducting fault monitoring of the first OS 1 and 
the AP monitor 4. 
10 The multi-OS controller 6 has the following functions. 

(a) Hardware Resource Division Function 

The multi-OS controller 6 manages computer resources such as CPU 
time, the memory, external devices, and timers, and divides these resources 
and provides them to respective OSs. For example, as for timers, the multi- 
15 OS controller 6 has a function of virtually providing only one hardware timer to 
respective OSs. Furthermore, the multi-OS controller 6 has a function of 
reserving an I/O resource of an external device utilized by respective OSs. 

(b) Inter-OS Message Passing Function 

This is a function for allowing processes on respective OSs to conduct 
20 message passing. Unlike the shared memory communication, the process of 
the other party of communication can be called. In the embodiment of the 
present invention, this message passing function is utilized as an alive 
message. 

(c) Inter-OS Shared Memory Communication Function 

25 The multi-OS controller 6 secures a memory region which can be 
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shared between OSs as a physical memory region independent of physical 
memory regions for respective OSs. The shared memory region is divided into 
several blocks and managed. Each block is provided with its identifier (ID). By 
utilizing this shared memory identifier (ID) or a pointer to the physical memory, 
5 each OS can access the shared memory. Supposing that information is not 
recorded in the memory region, each OS conducts memory initialization after 
it has been started. For example, when an OS is restarted at the time of a 
fault of the OS, a memory region in which the state and so on of an 
application has been preserved is initialized. However, the shared memory 

10 region is not initialized because it is located outside the OSs regions, and its 
contents are preserved. After restart, each OS can utilize a pseudo-non- 
volatile memory or a software manner by specifying the shared memory 
identifier (ID) used the last time or a pointer to a physical memory. In the 
embodiment of the present invention, the shared memory region is utilized as 

15 a check-point information storage region of the monitored AP. 

(d) OS Fault Notice and Fault Recovery Function 
The OS fault notice is a function of giving notice that a fault has 
occurred in a certain OS to the other OS. However, noticed faults are only 
faults that the OS itself is conscious of. Furthermore, there is a function of 

20 conducting recovery of a certain OS from a fault by restarting only the OS. 

The monitored AP 5 transmits an alive message to the AP monitor 4 
periodically. The AP monitor 4 determines whether it has received an alive 
message from the monitored AP 5 within a predetermined fault detection time. 
Unless an alive message has been received, the AP monitor 4 judges the 

25 stoppage to have been caused by a fault of the monitored AP 5. In 
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accordance with a preset method for recovery from a fault, the AP monitor 4 
conducts recovery processing of the monitored AP 5 in which a fault has 
occurred. 

The first OS 1 and the AP monitor 4 periodically transmit alive 
5 messages to the high rank monitor 3 by utilizing the inter-OS message 

passing function of the multi-OS controller 6 independently respectively. The 
high rank monitor 3 determines whether it has received an alive message 
from each of the first OS 1 and the AP monitor 4 within a predetermined fault 
detection time. Unless an alive message has been received, the high rank 

10 monitor 3 judges the stoppage to have been caused by a fault of pertinent one 
of the first OS 1 and the AP monitor 4. In accordance with a preset method for 
recovery from a fault, high rank monitor 3 conducts recovery processing of the 
first OS 1 or the AP monitor 4 in which a fault has occurred. 

The monitored AP 5 periodically stores checkpoint information in the 

15 shared memory region 1 1 of the memory 9. When the monitored AP 5 is 
restarted due to a fault of the first OS 1 , the AP monitor 4, or the monitored 
AP 5, the monitored AP 5 takes out check-point information from the shared 
memory region 1 1 , and resumes processing of the monitored AP 5 from the 
check-point. In the present embodiment, only the high rank monitor 3 is 

20 mounted on the second OS 2. The second OS 2 need only have a function for 
responding to a request from the high rank monitor 3. As compared with the 
first OS 1 , the second OS 2 can be regarded as an OS which is small in 
overhead of CPU time and highly reliable. Typically, the memory occupation 
quantity of the second 05 2 is smaller than the memory occupation quantity of 

25 the first OS 1. 
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It is possible to store the first OS 1 , the second OS 2, the high rank 
monitor 3 and the multi-OS controller 6, or the first OS 1 , the second OS 2, 
the high rank monitor 3, the AP monitor 4 and the multi-OS controller 6 into a 
storage medium, read them into the memory 9 via a drive device connected to 
5 the processor 1 0 or transmit them to the memory 9 of another computer via a 
drive device connected to the computer, the computer, and a network, and 
execute them by using the processor 10. 

FIG. 2 is a diagram showing the configuration of a storage region in the 
memory 9. However, a program region and the like are not illustrated. With 

10 regard to a region for storing control information and the like concerning the 
present invention, its data configuration is shown. 

High rank monitor control information 12 is information placed in a 
memory region of the second OS 2 and accessed by the high rank monitor 3. 
As control information 12, OS fault detection time 13 and an OS fault recovery 

15 method 14 are stored with regard to a monitored subject ID 18, and AP 

monitor fault detection time 15 and an AP monitor fault recovery method 16 
are stored with regard to a monitored subject ID 20. The monitored subject ID 
18 is an identifier provided to the first OS 1 . The OS fault detection time 13 is 
time allowed until an alive message is received from the first OS 1 when the 

20 high rank monitor 3 monitors a fault of the first OS 1 . The OS fault recovery 
method 14 is information which indicates a recovery method at the time of a 
fault of the first OS 1 . The OS fault recovery method 14 is divided into OS 
restart without memory dump, OS restart after memory dump acquisition, 
restart of only a specific function of an OS, and so on. The monitored subject 

25 ID 20 is an identifier provided to the AP monitor 4. AP monitor fault detection 
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time 15 is time allowed until an alive message is received from the AP monitor 
4 when the high rank monitor 3 monitors a fault of the AP monitor 4. The AP 
monitor fault recovery method 16 is information which indicates a method to 
be used when the AP monitor 4 is recovered from a fault. The AP monitor 
5 fault recovery method 1 6 is divided into restart of the AP monitor 4 and the 
monitored AP 5, restart of the first OS 1 , the AP monitor 4 and the monitored 
AP 5, and so on. 

Since the AP monitor 4 is considered to be higher in reliability than the 
first OS 1 , a fault of the AP monitor 4 is caused in some cases by a fault of the 

10 first OS 1 . In this case, a series of restart operations beginning with restart of 
the first OS 1 followed by restart of the AP monitor 4 are sometimes effective. 
First OS control information 17 is information placed in a memory region of 
the first OS 1 and accessed by the AP monitor 4. Monitored subject ID 18 is 
stored as the first OS control information 17. Control information required to 

15 restart the first OS 1 is stored. 

AP monitor control information 19 is information placed in the memory 
region of the first OS 1 and accessed by the AP monitor 4. Monitored subject 
ID 20, AP fault detection time 21 , and an AP fault recovery method 22 are 
stored as the AP monitor control information 19. The AP fault detection time 

20 21 is time allowed until an alive message is received from the monitored AP 5 
when the AP monitor 4 monitors a fault of the monitored AP 5. The AP fault 
recovery method 22 is information which indicates a method to be used when 
the AP monitor 4 recovers the monitored AP 5 from a fault. The AP fault 
recovery method 22 is divided into restart of only the monitored AP 5, restart 

25 of the AP monitor 4 and the monitored AP 5, restart of the first OS 1 , the AP 
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monitor 4 and the monitored AP 5, and so on. For the same reason as the 
case of the recovery of the AP monitor 4, a series of restart operations 
beginning with restart of the first OS 1 followed by restart of the monitored AP 
5 are sometimes effective. 
5 Monitored AP control information 23 is information placed in the shared 

memory region 1 1 and accessed by the monitored AP 5. A head block of a 
checkpoint region 25 is stored as check-point identifier ID 24. The check-point 
region 25 is a region for storing check-point information placed in the shared 
memory region 1 1 and accessed by the monitored AP 5. 

10 FIG. 3 is a flow chart showing the procedure of start processing of 

respective programs conducted when the multi-OS computer 100 which is the 
embodiment of the present invention is started. After the processor 10 is 
started, the first OS 1 , the second OS 2 and the multi-OS controller 6 are 
started by execution of an automatic procedure (step 31). Upon being started, 

15 the multi-OS controller 6 conducts resource division using assignment of 
computer resources, such as CPU time, memories, and external devices, 
utilized by the first OS 1 and the second OS 2, by referring to a configuration 
management file which is not illustrated. Subsequently, the second OS 2 
starts the high rank monitor 3 (step 32). Its start procedure is the same as an 

20 ordinary start procedure for starting an AP on the second OS 2. Its start is 
conducted manually, or automatically by executing a command line on a file. 

Subsequently, the first OS 1 transmits OS fault detection time and an 
OS fault recovery method to the high rank monitor 3 via the multi-OS 
controller 6 and the second OS 2 by utilizing the inter-OS message passing 

25 function of the multi-OS controller 6. The high rank monitor 3 receives such 
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information and sets the information in the memory 9 as the OS fault detection 
time 13 and the OS fault recovery method 14 (step 33). The high rank monitor 
3 issues a monitored subject ID (for example, 01), adds it to the OS fault 
detection time 13 and the OS fault recovery method 14 on the memory. In 
5 addition, the high rank monitor 3 transmits the issued monitored subject ID to 
the first OS 1 via the second OS 2 and the multi-OS controller 6 by utilizing 
the inter-OS message passing function. The first OS 1 receives it and sets it 
in the memory 9 as the monitored subject ID 18 of the first OS control 
information 17. Thereafter, the high rank monitor 3 starts monitoring of the 

10 first OS 1 by referring to the OS fault detection time 13 (step 34). In other 
words, the first OS 1 transmits an alive message to the high rank monitor 3 
via the multi-OS controller 6 and the second OS 2 by using the inter-OS 
message passing function. The high rank monitor 3 receives this alive 
message, and determines whether the alive message has been received 

15 within the OS fault detection time 13. Subsequently, the first OS 1 starts the 
AP monitor 4 (step 35). Its start procedure is the same as the ordinary start 
procedure for starting an AP on the first OS 1 . The AP monitor 4 is 
automatically started manually or automatically by executing a command line 
on a file. Subsequently, the AP monitor 4 transmits AP monitor fault detection 

20 time and an AP monitor fault recovery method to the high rank monitor 3 via 
the first OS 1 , the multi-OS controller 6 and the second OS 2 by utilizing the 
inter-OS message passing function. The high rank monitor 3 receives such 
information and sets the information in the memory 9 as the AP monitor fault 
detection time 15 and an AP monitor fault recovery method 16 of the monitor 

25 control information 12 (step 36). The high rank monitor 3 issues a monitored 
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subject ID (for example, 02), and adds it to the AP monitor fault detection time 
15 and an AP monitor fault recovery method 16 on the memory. In addition, 
the high rank monitor 3 transmits the issued monitored subject ID to the AP 
monitor 4 via the second OS 2, the multi-OS controller 6 and the first OS 1 by 
5 utilizing the inter-OS message passing function. The AP monitor 4 receives it 
and sets it in the memory 9 as the monitored subject ID 20 of the AP monitor 
control information 19. Thereafter, the high rank monitor 3 starts monitoring of 
the AP monitor 4 by referring to the monitor fault detection time 15 (step 37). 
In other words, the AP monitor 4 transmits an alive message to the high rank 
10 monitor 3 via the first OS 1 , the multi-OS controller 6 and the second OS 2. 
The high rank monitor 3 receives this alive message, and determines whether 
the alive message has been received within the AP monitor fault detection 
time 15. 

Subsequently, the first OS 1 starts the monitored AP 5 (step 38). The 
15 monitored AP 5 issues a request for the region of the check-point identifier ID 
24 and the check-point region 25 to the multi-OS controller 6 via the first OS 
1 , and acquires them (step 39). Furthermore, the monitored AP 5 sets the 
identifier ID of the check-point region 25 in the check-point identifier ID 24 of 
the monitored AP control information 23. Subsequently, the monitored AP 5 
20 transmits AP monitor fault detection time and an AP monitor fault recovery 
method to the AP monitor 4 by utilizing the inter-process communication 
function of the first OS 1 . The Al' monitor 4 receives such information and sets 
the information in the memory 9 as the AP fault detection time 21 and an AP 
fault recovery method 22 of the monitor control information 19 (step 40). 
25 Thereafter, the AP monitor 4 starts monitoring of the monitored AP 5 by 



referring to the AP fault detection time 21 of the AP monitor control 
information 19 (step 41). In other words, the monitored AP 5 transmits an 
alive message to the AP monitor 4 by using the inter-process communication 
function. The AP monitor 4 receives this alive message, and determines 
5 whether the alive message has been received within the AP fault detection 
time 21. 

The monitored AP 5 advances processing of the AP, while periodically 
storing check-point information in the check-point region 25 indicated by the 
check-point identifier ID 24. 

10 FIG. 4 is a flow chart showing the procedure of OS monitoring 

processing conducted by the high rank monitor 3 and recovery processing of 
a relating program at the time of an OS fault. The first OS 1 periodically 
executes a program for transmitting an alive message to the high rank monitor 
3 by utilizing the inter-OS message passing function, and thereby 

15 automatically transmits an alive message to the high rank monitor 3. The 
transmission period of the alive message is a time within the OS fault 
detection time 13. The alive message includes the monitored subject ID 18 
and a one-bit signal (1 or 0) indicating an alive or dead state. By referring to 
the monitored subject ID 18 of the high rank monitor control information 12, 

20 the high rank monitor 3 determines whether an alive message has come from 
the first OS 1 (step 51). When the high rank monitor 3 has received an alive 
message indicating an alive state (YES of step 51), a timer is reset and the 
processing returns to the step 51. If an alive message indicating an alive state 
is not received (NO of step 51 ) and the OS fault detection time 13 has not 

25 elapsed (NO of step 52), then the processing returns to the step 51 . 



If the OS fault detection time 1 3 has elapsed (YES of step 52), then the 
high rank monitor 3 judges the first OS 1 to have stopped due to a fault, and 
conducts processing for recovery from the OS fault in accordance with the OS 
fault recovery method 14 (step 53). The high rank monitor 3 requests the 
5 multi-OS controller 6 to conduct the recovery processing of the first OS 1 from 
a fault. The multi-OS controller 6 conducts the recovery processing of the first 
OS 1 from a fault. If fault information such as a memory dump of the first OS 1 
is recorded in a disk or the like, it can be used to pursue the cause of a fault of 
the first OS 1 later. In this case, however, restart of the first 05 1 is delayed. 

10 There is trade-off between the requested restart time of the first OS 1 and the 
merit of recording. If the recovery processing is finished, then restart of the 
first OS 1 and resetting of control information are conducted (step 54). Even if ' 
a fault occurs in the first OS 1 and the first OS is restarted, the high rank 
monitor control information 12 located in the memory region of the second OS 

15 2 is preserved. Since the high rank monitor 3 needs to detect the timing of 
restart termination of the first OS 1 and the OS fault recovery method 14 
needs to be changeable, however, the high rank monitor 3 clears the OS fault 
detection time 13, the OS fault recovery method 14, and the monitored 
subject ID 18, and resets the OS fault detection time 13, the OS fault recovery 

20 method 14, and the monitored subject ID 18 in accordance with the above 

described procedure. The high rank monitor 3 transmits the monitored subject 
ID 18 to the first OS 1 , and then restarts the fault monitoring of the first OS 1 . 

In the case where the first OS 1 is restarted due to the OS fault 
detection, the AP monitor 4 is also forcibly stopped. Therefore, restart of the 

25 AP monitor 4 and resetting of the control information are conducted (step 55). 
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In the same way, the high rank monitor 3 clears the AP monitor fault detection 
time 15, the AP monitor fault recovery method 16, and the monitored subject 
ID 20, and resets the AP monitor fault detection time 15, the AP monitor fault 
recovery method 16, and the monitored subject ID 20 in accordance with the 
5 above described procedure. The high rank monitor 3 transmits the monitored 
subject ID 20 to the AP monitor 4, and then restarts the fault monitoring of the 
AP monitor 4. 

Subsequently, the high rank monitor 3 conducts restarts of the 
monitored AP 5 and resetting of the control information (step 56). The 

10 monitored AP 5 recognizes its restart, and resets the AP fault detection time 
21 and the AP fault recovery method 22 in accordance with the above 
described procedure. Subsequently, the monitored AP 5 acquires check-point 
information from the check-point region 25 indicated by the check-point 
identifier ID 24 (step 57). The AP monitor 4 resets the AP fault detection time 

15 21 and the AP fault recovery method 22, and then restarts the fault monitoring 
of the monitored AP 5 (step 58). In the same way as the foregoing description, 
the monitored AP 5 periodically stores check-point information in the check- 
point region 25. 

FIG. 5 is a flow chart showing the procedure of AP monitor monitoring 
20 processing conducted by the high rank monitor 3 and recovery processing of 
a relating program at the time of an AP monitor fault. The AP monitor 4 calls a 
function for transmitting an alive message to the high rank monitor 3, and 
periodically transmits an alive message to the high rank monitor 3. The 
transmission period of the alive message is a time within the AP monitor fault 
25 detection time 1 5. The alive message includes the monitored subject ID 20 
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and a one-bit signal (1 or 0) indicating an alive or dead state. By referring to 
the monitored subject ID 20 of the high rank monitor control information 12, 
the high rank monitor 3 determines whether an alive message has come from 
the AP monitor 4 (step 61 ). When the high rank monitor 3 has received an 
5 alive message indicating an alive state (YES of step 61 ), a timer is reset and 
the processing returns to the step 61 . If an alive message indicating an alive 
state is not received (NO of step 61) and the AP monitor fault detection time 
15 has not elapsed (NO of step 62), then the processing returns to the step 
61. 

10 If the AP monitor fault detection time 1 5 has elapsed (YES of step 62), 

then the high rank monitor 3 judges the AP monitor 4 to have stopped due to 
a fault, and conducts processing for recovery of the AP monitor 4 from the 
fault in accordance with the AP monitor fault recovery method 16. Here, the 
AP monitor fault recovery method 16 conducts restart beginning from the AP 

15 monitor 4 and resetting of control information (step 63). The high rank monitor 
3 clears the AP monitor fault detection time 1 5, the AP monitor fault recovery 
method 16, and the monitored subject ID 20, and resets the AP monitor fault 
detection time 15, the AP monitor fault recovery method 16, and the 
monitored subject ID 20 in accordance with the above described procedure. 

20 The high rank monitor 3 transmits the monitored subject ID 20 to the AP 
monitor 4, and then restarts the fault monitoring of the AP monitor 4. 

It is necessary to do start setting of the monitored AP over again, with 
the restart of the AP monitor 4. Therefore, restarting of the monitored AP 5 
and resetting of control information are conducted (step 64). The monitored 

25 AP 5 recognizes its restart, and resets the AP fault detection time 21 and the 
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AP fault recovery method 22 in accordance with the above described 
procedure. Subsequently, the monitored AP 5 acquires check-point 
information from the check-point region 25 indicated by the check-point 
identifier ID 24 (step 65). The AP monitor 4 resets the AP fault detection time 
5 21 and the AP fault recovery method 22, and then restarts the fault monitoring 
of the monitored AP 5 (step 66). In the same way as the foregoing description, 
the monitored AP 5 periodically stores check-point information in the check- 
point region 25. 

If restart beginning from the first OS 1 is set in the AP monitor fault 

10 recovery method 16, the high rank monitor 3 conducts the processing of the 
step 53, and thereafter the processing of the steps 54 to 58 is conducted. 

FIG. 6 is a flow chart showing the procedure of 5 monitored AP fault 
monitoring processing conducted by the AP monitor 4 and recovery 
processing of a relating program at the time of a monitored AP fault. The 

15 monitored AP 5 calls a function for transmitting an alive message to the AP 
monitor 4, and periodically transmits an alive message to the AP monitor 4. 
The transmission period of the alive message is a time within the AP fault 
detection time 21. The alive message includes a 1-bit signal indicating an 
alive or dead state. The AP monitor 4 determines whether an alive message 

20 has come from the monitored AP 5 (step 71 ). When the AP monitor 4 has 
received an alive message indicating an alive state (YES of step 71), a timer 
is reset and the processing returns to the step 71 . If an alive message 
indicating the alive state is not received (NO of step 71 ) and the AP fault 
detection time 21 has not elapsed (NO of step 72), then the processing 

25 returns to the step 71 . 
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If the AP fault detection time 21 has elapsed (YES of step 72), then the 
AP monitor 4 judges the monitored AP 5 to have stopped due to a fault, and 
conducts processing for recovery of the monitored AP 5 from the fault in 
accordance with the AP fault recovery method 22. Here, assuming that the AP 
5 fault recovery method 22 sets restart of only the monitored AP 5, restart of the 
monitored AP 5 and resetting of control information are conducted (step 73). 
The monitored AP 5 recognizes its restart, and resets the AP fault detection 
time 21 and the AP fault recovery method 22 in accordance with the above 
described procedure. Subsequently, the monitored AP 5 acquires check-point 

10 information from the check-point region 25 indicated by the check-point 

identifier ID 24 (step 74). The AP monitor 4 resets the AP fault detection time 
21 and the AP fault recovery method 22, and then restarts the fault monitoring 
of the monitored AP 5 (step 75). In the same way as the foregoing description, 
the monitored AP 5 periodically stores check-point information in the check- 

15 point region 25. 

If restart beginning from the AP monitor 4 is set in the AP fault recovery 
method 22 and restart beginning from the AP monitor 4 is set in the AP 
monitor fault recovery method 16, then the AP monitor 4 transmits an alive 
message indicating death of the AP monitor 4 to the high rank monitor 3, and 

20 thereby processing of steps 63 to 66 is conducted. If a series of restart 

operations beginning from restart of the first OS 1 and followed by restart of 
the AP monitor 4 is set in the AP fault recovery method 22 and a series of 
restart operations beginning from restart of the first OS 1 and followed by 
restart of the AP monitor 4 is set in the AP monitor fault recovery method 16, 

25 then the high rank monitor 3 conducts processing of the step 53, and 
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thereafter processing of steps 54 to 58 is conducted. 
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